-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use generation in pod disruption budget #36438
Conversation
// Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other | ||
// status informatio is valid only if observedGeneration equals to PDB's object generation. | ||
// +optional | ||
ObservedGeneration *int64 `json:"observedGeneration,omitempty" protobuf:"varint,1,opt,name=observedGeneration"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that there is no accidental generation equality if observedgeneration/generation is not set. I followed the pattern we had with HPA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on the other hand ReplicaSetStatus and DeploymentStatus use just int...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no need for it to be a pointer.
@@ -61,7 +54,19 @@ type PodDisruptionBudgetStatus struct { | |||
// the list automatically by PodDisruptionBudget controller after some time. | |||
// If everything goes smooth this map should be empty for the most of the time. | |||
// Large number of entries in the map may indicate problems with pod deletions. | |||
DisruptedPods map[string]unversioned.Time `json:"disruptedPods" protobuf:"bytes,5,rep,name=disruptedPods"` | |||
DisruptedPods map[string]unversioned.Time `json:"disruptedPods" protobuf:"bytes,2,rep,name=disruptedPods"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't change order of fields, as this affects protobuf representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - actually since this wasn't yet released, you probably can do it.
@@ -116,6 +116,10 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje | |||
} | |||
|
|||
func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) { | |||
if pdb.Status.ObservedGeneration == nil || *pdb.Status.ObservedGeneration != pdb.Generation { | |||
return false, fmt.Errorf("pdb status has not been updated after spec change") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this should be an error. Because this is the situation that will be happening in a real system. It just means that we didn't yet updated PDB. So I think that we should:
return false, nil
here.
For me error is for something that is not expected, and this can definitely happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eviction needs to be retried until observedGeneration matches generation. Two options I see: either poll for some time here or force the request to be retried (by returning a ServerTimeoutError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or both.
@@ -116,6 +116,10 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje | |||
} | |||
|
|||
func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) { | |||
if pdb.Status.ObservedGeneration == nil || *pdb.Status.ObservedGeneration != pdb.Generation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing that bumps Generation since there is no spec change happening so it will always be 1 for PDBs. You will either need to bump Generation here when you do the update or you should probably use a different pair of Generation/ObservedGeneration because Generation is meant to be bumped only on spec changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked with @wojtek-t and we initially agreed that there is optimistic locking implemented in client/etcd that would make the extra generation not needed. However now he has some second thoughts how it exactly works and if there are retries on version mismatch. We are checking it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to set unconditionalUpdate to false https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/registry/store.go#L370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this solves the issue with Generation not being incremented by anything atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i understand correctly each object has a version that is incremented every update. It is possible to allow update only if the internal etcd version equals to the version in the update. With that no 2 concurrent updates would be able to use the same budget at the same time. The time the first one gets in the version is incremented and the second one fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though - ObservedGeneration doesn't provide any value now (as @Kargakis observed, because spec doesn't change) - so we should just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. It protect us against unsynced spec changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the unconditionalUpdate change helps because all it does is to allow unconditional updates if the incoming resource version is zero, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. If the unconditionalUpdate is false they we go into "else" branch:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/registry/store.go#L408
where we face this "if":
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/registry/store.go#L421
Please don't add "retest-not-required" in such large PRs |
a8d1022
to
f8991c1
Compare
Jenkins unit/integration failed for commit f8991c1. Full PR test history. The magic incantation to run this job again is |
This generally LGTM - but please fix tests as the failure is related. |
@@ -116,6 +116,9 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje | |||
} | |||
|
|||
func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) { | |||
if pdb.Status.ObservedGeneration != pdb.Generation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing increments Generation hence this will be always false once the controller observes a PDB for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I though that Generation is automatically incremented by our infrastructure on every Spec update, but it seems this has to be done manually. For example RC is doing it here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/controller/strategy.go#L81
@mwielgus - this should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already there for PDB but I thought the spec is immutable.
newPodDisruptionBudget.Generation = oldPodDisruptionBudget.Generation + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah ok, thanks.
Why do you say that spec is immutable? Is there some validation rule that disallows updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you say that spec is immutable? Is there some validation rule that disallows updates?
Currently yes:
allErrs = append(allErrs, field.Forbidden(field.NewPath("spec"), "updates to poddisruptionbudget spec are forbidden.")) |
@@ -91,9 +91,10 @@ func (podDisruptionBudgetStrategy) ValidateUpdate(ctx api.Context, obj, old runt | |||
return append(validationErrorList, updateErrorList...) | |||
} | |||
|
|||
// AllowUnconditionalUpdate is the default update policy for PodDisruptionBudget objects. | |||
// AllowUnconditionalUpdate is the default update policy for PodDisruptionBudget objects. Status update should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllowUnconditionalUpdate is used only if the resource version in the UPDATE request is zero:
doUnconditionalUpdate := resourceVersion == 0 && e.UpdateStrategy.AllowUnconditionalUpdate() |
How is this helping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newVersion, err := e.Storage.Versioner().ObjectResourceVersion(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, seems like a small improvement but we already use a resource version in the eviction REST so you would still get an update conflict on a different rv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eviction rest is per pod. When pods are different there is no conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't talking about pods but about PDB. Even if AllowUnconditionalUpdate returns true, and the eviciton REST tries to update the status of a PDB with a rv older than what etcd has you will get an update conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I am fine with this change but it doesn't solve the issue that is solved by Generation/ObservedGeneration.
f8991c1
to
ada53fa
Compare
Test fixed. |
@Kargakis Are you ok with merging this PR? |
At its current form, what you have is not going to work. There are two issues at play: 1) you need the eviction REST to increment pdb.Generation every time it updates the status of the PDB but 2) Generation is supposed to be incremented ONLY on spec changes so you essentially change its meaning. I am proposing a separate set of fields (EvictionGeneration / EvictionObservedGeneration) that will be specific to the Eviction update. I think @bgrant0607 has proposed somewhere that every component that updates an object and wants to observe its update, should have it's own generation/observedGeneration pair. |
Ref #7328 (comment) |
Honestly i don't understand why exactly it won't work. Object version (along with the policy) should guarantee that no two pdb status updates occur at the same time, because the second one would have an incorrect version. This is a Compare-And-Swap type of operation. Any extra field/pair of fields that is required to be handled by both apiserver and controller on every update would basically mean a full round trip from APIserver to controller and back and would make PDB really slow. |
This should already be fixed in master (the eviction REST is not setting rv to zero so it never does an unconditional update for a PDB). You still don't guarantee that what the eviction REST sees at a given point in time for a PDB (specifically pdb.status.PodDisruptionsAllowed) is true iow. observed by the PDB controller. |
Yes, there is no guarantee that what eviction handler sees is true because a change could occur just a second ago and is not yet noticed by PDB controller. The question is whether for perfect status setting by the controller would eviction handlers behave properly - in a sense that no unallowed disruption is made. I think that this PR makes it possible. |
Assuming all pod deletions go through eviction, generation/observedGeneration guarantees that what eviciton sees is true. If a change (meaning eviction) occurs, pdb.{status,spec}.evictionGeneration is incremented and the next eviction will need to wait for pdb.status.observedEvictionGeneration to match the new pdb.status.evictionGeneration. I can't think of any other way to synchronize the api server and the pdb controller.
I am not sure of the additional value at this point.
|
@Kargakis When I'm talking about version I'm not referring to generation or observed generation but resource version that is inside of every object and which is increased by Etcd on every update that happens on the object.
6A) ETCD receives the update. Checks what is the current version 4. Accepts the update and bumps version to 5. The corresponding pod is deleted. 7B) Handler gets PDB in version 5. Checks whether the eviction is possible. It is not. The negative response is returned to the client. Observed Generation is only needed to ensure that the new/updated spec was noticed by the controller. |
I would suggest to have @bgrant0607 or @lavalamp review this. I thought the idea of generation was that the PDB controller should set observed generation == current generation (TBH I don't remember how current generation is represented in general -- is it just the resource version on the controller?) when its status reflects the effects of its spec. So for example it prevents you from using Status.PodDisruptionsAllowed right after you change MinAvailable or Selector, until Status.PodDisruptionsAllowed reflects the new value of MinAvailable or Selector. That doesn't seem to be what you're doing here. But I admit I may be misunderstanding the observed generation stuff. |
@davidopp |
This should already happen in master because in order to do an unconditional update you need also to pass a zero rv but the eviction REST does a normal (conditional) update. |
Yes, this already happens in this PR. It doesn't fix #35324 though because the eviction REST cannot be sure about PodDisruptionsAllowed at any given point (spec updates to the PDB are not going to happen often anyway). I imagine that requiring from eviction to wait for every prior eviction to be observed by the controller will be really slow. We could potentially allow an amount of deviation (for example proceed with the eviction if evictionGeneration <= evictionObservedGeneration + minAvailable) |
Could you please describe a scenario when PodDisruptionsAllowed is incorrect? |
Unless all deletes go through the eviction REST, PodDisruptionsAllowed may end up out of sync easily because the PDB controller is using caches and it may lag behind (even a lot in some corner cases). Example:
|
PodDisruptionBudget will not stay at 5 because PDB update is triggered also on pod status change. Yes, for a brief moment PDB status will not be accurate but you have no way to prevent it. |
That's the part where I am talking about the caches the PDB controller is using. Replica set and pod caches are going to have a ton of objects so eventually they will start lagging. What is the plan with the eviction api? I know it has already been added in |
I guess that eventually everything will go through eviction handler. Until then there is absolutely no possibility for pdb controller to have super-current status, no matter how many extra fields we add there. Right now we try to guarantee that if eviction handler is not bypassed (direct pod deletions, node crashes, etc) then PDB works as intended. |
Ok, observedGeneration/generation changes in this PR LGTM. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Typo from 8e2347370e (Add observedGeneration to PodDisruptionBudgetStatus, 2016-11-08, #36438).
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to PodDisruptionBudgetStatus, 2016-11-08, kubernetes/kubernetes#36438). Kubernetes-commit: a05cabb3a62c18c9cff9ae8c0b77f2f0e57a5ea7
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to PodDisruptionBudgetStatus, 2016-11-08, kubernetes/kubernetes#36438). Kubernetes-commit: a05cabb3a62c18c9cff9ae8c0b77f2f0e57a5ea7
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to PodDisruptionBudgetStatus, 2016-11-08, kubernetes#36438).
Fixes #35324
Previously it was possible to use allowedDirsruptions calculated for the previous spec with the current spec. With generation check API servers always make sure that allowedDisruptions were calculated for the current spec.
At the same time I set the registry policy to only accept updates if the version based on which the update was made matches to the current version in etcd. That ensures that parallel eviction executions don't use the same allowed disruption.
cc: @davidopp @Kargakis @wojtek-t
This change is